SRI International Results February 1992 ATIS Benchmark Test

نویسندگان

  • Douglas E. Appelt
  • Eric Jackson
چکیده

We describe the results that SRI International achieved on the February 1992 ATIS Speech and Natural Language System Test. The basic architecture of the system is described, including a set of parameters capable of altering the system's behavior and processing strategy. We report on several experiments that were run on the February test set to evaluate several processing strategies for both natural-language only and full spoken-language system tests. 1. I N T R O D U C T I O N This paper reports on the results of running SRI International's spoken-language system on the DARPAsponsored February 1992 test. The system's naturallanguage processing has been parameterized in several ways to achieve different behaviors. In addition to running our system with what we believed at the time of the test to be the optimal parameter settings to produce our official results, we have conducted some experiments by running the system with a variety of parameter settings. The results of these experiments shed some light on the trade-offs among various SLS and natural-language processing strategies, and provide some interesting data for evaluating the evaluation methodology itself. 2. SYSTEM D E S C R I P T I O N The SLS system used for the February evaluation is an integration of the SRI DECIPHER speech recognition [1,4,5] system with the SRI TRAVELOGUE naturallanguage processing system. The integration between these two systems is currently accomplished by a simple serial interface: the best accoustic hypothesis is processed by the NL system to produce the answer to the query. T h e D E C I P H E R S y s t e m DECIPHER is a speaker-independent continuous-speech speech recognition system based on tied-mixture Hidden Markov Model (HMM) models. It uses six features, three being vectors (cepstra, delta-cepstra, and deltadelta-cepstra) and three scalars (energy, delta-energy, and delta-delta-energy). These features are computed 95 from a filter bank that is derived via an F F T and highpass filtered (RASTA filtered) in the log-spectral-energy domain. DECIPHER models pronunciation variability through word networks generated by linguistic rules then pruned probabilistically. There are cross-word acoustic and phonological models. Parallel recognizers were implemented and trained separately on male and female speech. The DECIPHER-ATIS system uses a backed-off bigram language model to reduce the perplexity of the input speech. The acoustic models were trained on all available ATIS spontaneous and read data (excluding 809 sentences used for system development that include 362 October 1991 dry run sentences and 447 MADCOW sentences). The backed-off bigram language model was trained on the available ATIS spontaneous speech data. This included 14,779 sentences (approximately 150,000 words). The recognition lexicon consisted of all words spoken in all available spontaneous ATIS data. There are also lexical entries for breaths and silence. No catch-all rejection model was used for out-of-vocabulary items. The vocabulary size is 1385 words. The TRAVELOGUE System The TRAVELOGUE system consists of a templatematching sentence-analysis mechanism [3] coupled with a context-handling mechanism and a database query generation component. The template matcher operates by producing templates from the input sentence which then get translated into database queries. The two main components of a template are the template type, which generally corresponds to a relation in the underlying database, and a set of filled slots, which represent constraints present in the query. A template for the sentence "Show me the nonstop flights from Boston" might be of the type "flight" and have an origin slot filled with "Boston" and a stops slot filled with "0." In addition to these components, a template contains an illocutionary force marker (e.g., "show," "how many," "yes/no"), and a list of explicitly requested fields from the relation associated with the template type. There are 20 different template types and 110 distinct slots. The template matcher determines the type of template by looking for certain key nouns or key phrases in the sentence. It incorporates a simple noun phrase grammar that allows it to identify phrases containing key nouns. The presence of a key noun in certain contexts (e.g., in a noun phrase preceded by a word like "show") will more strongly trigger the associated template type than an isolated occurrence of that key noun. Conjunctions of noun phrases containing key nouns produce templates with multiple template types. Slots are filled by matching regular-expression patterns against the input string. For example, "from" followed by an airport or city name may fill the origin slot of the flight template. To find fillers for slots, the template matcher makes use of a lexicon of names and codes, each associated with the appropriate sort, and special grammars tier recognizing numbers, dates, and times. For each template type with some key noun or key phrase present in the sentence, the system tries to find the best "slot covering" of the sentence it can. That is, it tries to find the sequence of slot-filling patterns that matches the sentence and consumes as many words as possible. Two constraints are (1) slot filling phrases may not overlap, and (2) no slot may be filled twice with different Values. The system incorporates a schematic mapping of the domain, which contains the information as to how entities are related, and allows the system to determine what slots are possible for each template. In the :next stage, the system chooses a single template from the set of candidate templates that have been constructed. I t chooses on the basis of several factors, including the type of key that triggered the template and the number of words consumed in filling slots. A template score is then computed for the chosen template, reflecting the proportion of words in the sentence that are considered to be consumed. Words that fill slots or help slots get filled count, as well as function words and certain other words (such as "please") that are ignored for the purposes of scoring. If the template does not score above a threshold, the system chooses not to risk answering the query. The threshold can be varied depending on how much risk of a wrong answer can be tolerated. For evaluation we have found a threshold of about 0.85 to be optimal, while for data collection we use a lower threshold, typically 0.5. The template matcher incorporates special mechanisms to handle certain types of false starts and complex conjunctions. These phenomena cannot be handled well in a straightforward, unaugmented, template-matching ap-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Real-Time Spoken-Language System for Interactive Problem Solving

SRI has developed a spoken language system to retrieve air travel planning information. Progress can be measured by comparing DARPA benchmark results in February 1992 and November 1992. Between February 1992 and November 1992, for all utterances tested, SRJ's word error rate in the ATIS speech recognition test improved from 11.0% to 9.1%. Weighted utterance error improved from 31.1% to 23.6% in...

متن کامل

DARPA February 1992 ATIS Benchmark Test Results

This paper documents the third in a series of Benchmark Tests for the DARPA Air Travel Information System (ATIS) common task domain. The first results in this series were reported at the June 1990 Speech and Natural Language Workshop [1], and the second at the February 1991 Speech and Natural Language Workshop [2]. The February 1992 Benchmark Tests include: (1) ATIS domain spontaneous speech re...

متن کامل

NIST-ARPA Interagency Agreement: Human Language Technology Program

PROJECT GOALS 1. To coordinate the design, development and distribution of speech and natural language corpora for the ARPA Spoken Language research community, and the use of these corpora for technology development and evaluation. 2. To design, coordinate the implementation of, and analyze the results of performance assessment benchmark tests for ARPA's speech recognition and spoken language u...

متن کامل

NIST-DARPA Interagency Agreement: Spoken Language Program

1. To coordinate the design, development and distribution of speech and natural language corpora for the DARPA Spoken Language research community. 2. To design, coordinate implementation, and analyze results, of performance assessment "benchmark tests" for DARPA's speech recognition and spoken language understanding systems. 1. Completed production of the six-CD-ROM-set for ATIS0, and made this...

متن کامل

Session 3: Spoken Language Systems III

Since February 91, the ATIS database has been updated and, in an effort to quickly collect a larger amount of training and test data, a concerted effort has taken place in collecting data at five different sites (AT&T, BBN, CMU, MIT, and SRI). (See [1] for details.) About 10,000 spontaneous utterances were collected, of which about half were annotated (text transcriptions, reference answers, et...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992